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Abstract 

^^ ■ We explore model-based techniques of phylogenetic tree inference exercising Markov invariants. 

Markov invariants are group invariant polynomials and are distinct from what is known in the 

literature as phylogenetic invariants, although we establish a commonality in some special cases. 

pL^ I We show that the simplest Markov invariant forms the foundation of the Log-Det distance measure. 

Ph ■ We take as our primary tool group representation theory, and show that it provides a general 

Q I framework for analyzing Markov processes on trees. From this algebraic perspective, the inherent 

symmetries of these processes become apparent, and focusing on plethysms, we are able to define 

Markov invariants and give existence proofs. We give an explicit technique for constructing the 

O"^. invariants, valid for any number of character states and taxa. For phylogenetic trees with three and 

four leaves, we demonstrate that the corresponding Markov invariants can be fruitfully exploited 

'!::;j- • in applied phylogenetic studies. 
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1 Introduction 

1.1 Background 

Molecular phylogenetic methods aim to infer the past evolutionary relationships of organisms 
from present day molecular data such as nucleotide sequences. Progress is made by making astute 
assumptions about the evolutionary process, which simplify the problem into a mathematical form, 
while retaining much of the structure motivating the biological question at hand. This process of 
mathematical modeling is essential if informed inferences from observed data sets are to be made. 

The most significant simplification made in phylogenetic models is that the evolutionary change 
of the molecular units is assumed to progress by mutation under environmental infiuences and the 
Darwinian effects of selection are ignored. Another overriding simplification, featuring in all the 
popular models, is that the effect of mutations is modelled as a stochastic (random) process 
assumed to be Markov. Also, it is often assumed that any given site in a molecular sequence 
evolved independently of the other sites, and the probability of mutation at each site is identically 
distributed (known together as the IID assumption). Although the IID assumption is known not 
to hold in many cases [58], we will assume throughout that IID holds, and defer modification of 
the results presented here to this more general case. 

Much progress in phylogenetic inference has been achieved in recent years with the use of so- 
phisticated mathematics, probability and statistical theory, and the advent of powerful computing 
techniques. A general rule that rates the scientific credence of a phylogenetic method is that model 
based techniques are preferred. In particular, some recent work has focused on the elucidation of 



the implicit model assumptions of popular methods such as Neighbor- Joining [21 HH] and Maxi- 
mum Parsimony '85]. This type of analysis is an essential part of the scientific justification because 
otherwise it is not exactly clear what is being estimated in the statistical sense. Without such a 
framework the biologist is left without any information regarding the confidence in the inference 
produced. 

An overlying difficulty in phylogenetic tree inference is that the number of possible trees is vast, 
and the space of trees is non-Euclidean; hence it is not clear how one should proceed in searching 
through it. It is normal to begin with a candidate tree and then consider each of its neighbouring 
trees (under a given adjacency rule) and choose the new tree as that with the best score. There 
is a range of available tree perturbation types such as "prune and regraft" or Nearest Neighbour 
Interchange, which define these adjacencies. Which type is preferable is a matter of ongoing debate 
[14U34| and such heuristic techniques sometimes find only locally optimal solutions. In this paper 
we will not discuss the problems associated with large trees, but consider how small trees may 
be built under general model assumptions. We give a general framework for constructing small 
trees, which can then be used as a springboard for building larger trees using techniques such as 
'quartet puzzling' [78] or supertree methods (for arbitrarily sized subtrees) [8l[89]. 

Due to its importance for calculating divergence times of lineages, the rate of mutation present 
in models of evolution is of central importance in phylogenetics. There are several well-known 
limitations of the standard models involving the rate of mutation on a phylogenetic tree. For 
instance, the IID assumption is almost always violated by the existence of site-to-site rate variation 
[67] and by the existence of invariable sites [56] . Other issues include non-stationary processes of 
evolution (which leads to 'compositional heterogeneity' [S]), 'pattern heterogeneity' (where the 
pattern of substitutions varies across the sites [S7]) and 'heterotachy' (differential rates across the 
tree) |57j . Ignoring the invalidity of the simple models when such assumptions are violated leads 
to model mis-specification [7S] and (potentially) incorrect tree inference. 

An issue for any inference technique is that of 'consistency'; where consistency is always with 
respect to an explicit or implicit model (or family of models) of sequence evolution. Statistical 
consistency requires that if the data set is sampled from a distribution generated under the model 
assumptions, then the inference method tends to the correct answer 100% of the time as the size 
of the data set (length of the sequences) tends to infinity. For example Felsenstein ^T] showed 
that Maximum Parsimony (MP) is statistically inconsistent (with all but a small family of models 
[74]). 

As exemplified by the first three chapters of the recent review book 25], the statistically 
consistent, model based phylogenetic methods can be placed into three categories: Minimum 
Evolution (ME) and distance based methods, Maximum Likelihood (ML), and Bayesian methods. 
ME proceeds by defining a (model based) matrix of pairwise distances between the molecular 
sequences, and then minimizes the total tree length across the space of possible trees subject to 
some statistical criteria such as least squares (see Chapter 1 of [25]). ML proceeds by maximizing 
the 'likelihood' of the observed data set across the set of possible trees and models of evolution 
[22], [25] . Bayesian methods proceed using Bayes' theorem to calculate a posterior distribution 
on the space of possible trees given a prior distribution (usually uniform-which is an issue in 
itself as this does not correspond to any evolutionary model of tree generation [93]). For each of 
these methods the underlying model assumptions are explicit, and current research efforts revolve 
around implementing these methods under expanded assumptions and/or in a computationally 
efficient manner. 

Another desirable feature of any phylogenetic method is that the model on which it is based 
should be defined by as few numerical parameters as possible. The issue of scientific content of 
a model and parameter counts is discussed by Steel [75] in relation to the effectiveness of MP vs 
ML, where it was stated that the "predictive power of the theory. . .tends to be drowned out in 
a sea of parameter estimation" . This is a fundamental problem in model selection for biological 
inference, and corresponds to what is known as the bias/ variance trade off of parameter estimation 
[TT]; which in turn equates to the problem of "overfitting" or "underfitting" a data set. From an 
information theoretic perspective, a given data set contains only so much information from which 
the numerical parameters of a model may be estimated. A model with many parameters may fit 



the data very well, in that the parameter estimates may be close to their true values, but the 
corresponding variances will be large because there are relatively few data points. On the other 
hand, the variance of the estimates of a model with very few parameters will be smaller as there 
are many data points to estimate each parameter, but in this case the model runs the risk of being 
badly mis-specified, so that the parameter estimates may be biased. In this light, the 'covarion' 
model 169] deals with the effects of invariable sites whilst introducing only one extra parameter, 
and the 'gamma' model [92| accounts for site-to-site rate variation, with only an additional two 
parameters. Other methods for coping with heterotachy, rate variation and pattern heterogeneity 
include the partitioning of data sets and mixture models [S7]. However, all of these methods suffer 
because, in the general case, the models must include an individual rate matrix (containing up 
to twelve parameters) and an edge length parameter for each and every edge of the phylogenetic 
tree. In [7D] it was recently noted that the task of phylogenetic tree inference often lies in a region 
where there are more parameters than data points. 

To reduce the number of parameters in phylogenetic models, the evolutionary process is usually 
assumed to be stationary and reversible, the rate matrices are assumed to have a certain form 
(such as the Jukes-Cantor model with one parameter, or the Kimura models with two or three 
parameters), and each edge of the phylogenetic tree is assigned the same rate matrix (for details 
on these assumptions, see [TOl IH]). To accommodate non-stationary processes and associated 
compositional heterogeneity, it becomes necessary to introduce many more parameters into the 
model. In this circumstance it then becomes desirable to use a technique based on a general 
model but without the need to estimate the numerical parameters. In this light, a matrix of Log- 
Det pairwise distances combined with the Neighbor- Joining algorithm [59] achieves statistically 
consistent tree inference under the assumption of a general model. However this technique has 
its own shortcomings as distance methods only consider pairwise sequence alignments, ignoring 
much of the information available in the data set, and has problems with model mis-specification 
[84] , and the statistical properties of the Log-Det are not exactly known [29] . A recently presented 
method |42j fits a very general model, but clearly will have issues with over-parameterization and 
computational requirements. 

In summary, the desirable features of a given phylogenetic method are that it is based on a 
general model of sequence evolution, it is statistically consistent with a family of known models, 
and the number of parameters to be estimated is minimal. 

1.2 Markov invariants 

In this work we introduce the use of mathematical representation theory to the problem of phylo- 
genetic inference (further background to the results is presented in the PhD thesis [8j ) . We define 
'Markov invariants' and show that these functions, when evaluated on sequence data, can be put 
to work in the problem of phylogenetic tree inference under rather general model assumptions. 

Markov invariants are distinct from what is known in the literature as 'phylogenetic invariants' 
[131 [T9l [STl [77] . Markov invariants are a particular case of group invariant functions [66] and are 
hence more constrained by definition than phylogenetic invariants. Some Markov invariants are 
simultaneously phylogenetic invariants, but the reverse is not true in general. The structure of 
Markov invariants is more akin to that of the Log-Det function 52, 59J, which is constructed using 
the simplest example of a Markov invariant, yet it is not a phylogenetic invariant. 

The appeal of this approach is that Markov invariants do not assume any particular rate 
matrices or edge length parameters on the phylogenetic tree. Broad conditions of molecular 
evolution are thus accommodated, incorporating arbitrary substitution rates, non-stationary and 
time-inhomogeneous processes, heterotachy, and arbitrary pattern heterogeneity across the tree. 
Further, Markov invariants satisfy certain algebraic relations for particular phylogenetic trees, and 
can provide a novel method of tree inference. 

This approach to phylogenetic tree inference satisfies the desirable features given in the sum- 
mary above. That is, Markov invariants are valid for a general model of sequence evolution, 
statistical consistency is assured, and only a few parameters need to be estimated. 



In particular, for the quartet case, we give a tree inference routine, valid for these inclusive 
conditions, optimizing over only one parameter. It is hoped that, with additional understanding, 
this technique can be extended to larger trees. This will result in phylogenetic tree inference 
methods, valid for general models, that make use of only a few parameters. Such a possibility is 
very attractive, as all of the data is utilized, and a general model may be assumed with the risk 
of overfitting significantly reduced. 

In this paper we outline the theoretical background required to understand the derivation 
of Markov invariants. This will necessitate, in iJJJ an excursion into elementary measure theory 
on finite sets, and the construction of 'phylogenetic tensors'. In iJ3] we analyse certain groups 
affiliated with the Markov process, and review standard results from group representation theory. 
This section concludes with a derivation of existence conditions for Markov invariants. In ^ we 
report on the structure of Markov invariants for phylogenetic trees with three and four leaves, and 
give examples of how they can be incorporated into practical phylogenetic analyses. 

2 Measure theory, the Markov semigroup, and phylogenetic 
tensors 

In §2.11 we collect some basic properties of measures on finite sets, justifying the use of tensor 
product spaces in the context of Markov processes on phylogenetic trees. The results are rather 
elementary, but ultimately necessary to place the subsequent discussion on its proper footing. 
See, for example, [30j for an introduction to measure theory. In i i2.2l we use generating function 
techniques to calculate expectation values of various random variables (and functions thereof) 
associated with phylogenetic data sets. We give a simple example and show how to compute its 
unbiased estimator. We define the 'Markov semigroup' for the general time-inhomogeneous process 
f §2.3p . construct 'phylogenetic tensors' ( §2.4p . and, finally, define Markov invariants ( §2.5p . 



2.1 Probability measures on finite sets 

Consider a finite set labelled by natural numbers, K ^ {1, 2, . . . , k}. A probability measure on K, 
is a function /x : K — *■ [0, 1], such that, for any proper subset AcK and any sequence Ai,A2,... 
of pairwise disjoint subsets, the following conditions hold: 

/x(0) = 0, 
KA) < 1, 



^(u^O -E^(^*)' 



^iiK) = 1. 

We denote the set of probability measures on K as Ai{K). It follows from the third condition 
that for l<i<k the measures, Si{A) = l if i e A and otherwise, form a basis such that 

k 

1=1 

for all /i e M{K) with /ii :— fj,({i}). This definition is equivalent to the usual requirement of a 
probability distribution on a finite set: 

k / '' \ 

i=l \i=l / 

In phylogenetics the data sets under consideration are aligned sequences of molecular units. For 
example, in the case of DNA made up of the four nucleotides adenine, cytosine, guanine, thymine, 



we would have K = {A, C,G,T}, k — A and write ii' = {1, 2, 3, 4}. However, the resuhs presented 
here and in iJ3] are vahd for any k. In fJJlwe will concentrate on cases relevant to phylogenetics 
and investigate the Markov invariants for k — 2,3 and 4. 

In this work we do not consider the problem of aligning the sequence data, and assume through- 
out that the 'true' alignment (without gaps) can and has been found (where truth is relative to the 
modelling process). Under this circumstance, it becomes necessary to consider the direct product 
of K with itself ni times: 

K"' := x'^'K = K xK X ...xK 

with \K"^\ — k"\ Exactly as above, for any proper subset E C K™ and any sequence of pairwise 
disjoint subsets Ei, E2, . . ., a probability measure, /i G Ai{K"''), must equivalently satisfy 

m(0) = o, 

KE) < 1, 




Given that under a measure unions decompose into summations, it follows that we have the tensor 
product: 

MiK"") = (E)"'M(K) := M{K) ® M{K) ® . . . « M{K). 

Concretely, any subset of iiT™ can be expressed as a union of disjoint subsets of the form 

Aix A2 X ... X Am, 

with Ai,A2, . . . ,Am Q K. A basis for ®"^M{K) is then, for 1 < ii, 12, ■ • ■ , im < k, 

5^^®6i^®...®5^^{Al XA2X...X A,n) '.^ 5^^{Al)5,^{A2) . . . 5^^{A,n) , 

with 5i^{Ai)5i^{A2) ■ . ■ 6i^{Am) = 1 if {ii}x{i2}x. . .x{ijn} & AiX A2X . . .x A^ and otherwise. 
We index the elements {ii} x {12} x . . . x {im} as 

/ = 11*2 ■ ■ -im, 

and write 

/i/ = ^ii^i2...i„, := M({n} X {«2} X ... X {i„i}). 

We refer to m as the rank of the tensor fi. 

Previously the authors JGS and PDJ have presented probability distributions on phylogenetic 
trees in a tensor product formalism motivated from analogies to quantum physics [401 [83' . The 
formulation presented above places this construction on its proper measure-theoretic footing]]. In 
H2.4l we will relate a given (Markov) model of evolution on a phylogenetic tree with m leaves, to 
a unique rank m tensor P € (S)"^Ai(K). 

2.2 Random variables, generating function, expectation values and es- 
timators 

Any data set considered in a phylogenetic study is necessarily of finite extent, and we suppose 
that it is a sample drawn from some unknown distribution. We wish to define expectation values 
of such data (or events) and functions thereof. Throughout we will assume the IID assumption 



^We are indebted to Michael Baake for drawing our attention to this. 



holds, so that we need only consider the distribution of a single random variable. The probability 
of observing a particular state at a given site will be identical for all the other sites. 

For a set of m aligned sequences of length N, define a pattern to be the (ordered) set of states 
read across the m sequences at a particular site in the alignment. That is, a pattern takes the 
form I = 1x12 ■ ■ ■ inn where ia is the character state in the a*'' sequence. Define the random variable 
X as the pattern observed at a given site. A probability distribution for X can be specified using 
a probability measure /i G ®"^ M{K): 

P[X = iii2 •■ -im] = Mn»2...i™- (1) 

For a sequence of finite length iV, define Z as the random variable that counts the number of 
occurrences of each pattern I = iii2 ■ ■ -im in the alignment, so that 



(Z/) — (ZiiJ2...i„Jl<n,i2, 



l<fc7 



and X^/eAT'" ^i = ^ ■ Assuming that each site in the alignment is identically and independently 
distributed as ([T]), it follows that Z is multinomially distributed under the measure /i: 

This expresses, under the assumptions of /x, the probability of observing within the alignment of 
m sequences the specific number of occurrences of each of the possible character patterns Z = z. 
When we describe Markov invariants, we will need to discuss expectation values of the random 
variable Z and functions thereof. For any function (/>, the expectation value with respect to the 
measure /i is defined as 



E[<l){Z)]:=Y,4>{z)nZ^z-N], 



with the summation over all z such that X^/gk™ zi = N. 

Remembering that Z follows a multinomial distribution, it is in practice necessary to use 
generating function techniques in order to calculate these expectation values. The generating 
function on the formal variables s = (sj) = {si^i^,,,i^)i<:i^^i^^,,,^i^<k of the multinomial distribution 
is 



with 






{s,Z) := ^ siZj. 



N 

(2) 



/ex™ 

From the properties of the exponential function and the commutivity of differentiation and expec- 
tation, 



dG{s) 



^^iii 



[^1*2---«T1 



s=0 



Using the above closed form of the generating function, an elementary calculation returns 

as of course would be expected. This can be extended to find the expectation of any function of 
Z: 



As a concrete example, take m — 2 and consider the case (l){Z) = Z^^ — ^12^13. From the 
hnearity of the expectation values we have 

-^[■^44 ~ -^12^13] = -£-[^44] ^ E[Zi2Zi3], 

so we can consider each term in turn. Taking derivatives of the closed form of the generating 
function gives 

E[Zl^] = N{N - 1)mL + N^i^i 

and 

^[^12^13] = N{N - l)/xi2Ati3. 

Thus, in this case, the expectation value of (p is 

E[(j){Z)] = N{N - 1)04 - M12M13) + A^A^44. 

Given a (possibly unobservable) random variable 9, an estimator is another random variable 
which is a function of observable quantities such that its expectation value somehow approximates 
6. The bias of an estimator 9 is defined as the difference 

b{9) = E[9] - E[9], 

allowing for 9 to simply be a constant so that E[9] = 9. An unbiased estimator is simply an 
estimator with bias equal to zero. For example, a short calculation reveals that the unbiased 
estimator of 0(/i) above is 

(t){Z) - Z44 
N{N - 1) ' 

In general, if (f> is polynomial, computing an unbiased form is a straightforward matter of 
solving a sequence of difference equations. When it comes to discussing estimators for Markov 
invariants, we will show that unbiased forms can easily be defined. However, we will note that 
explicit computation is difficult due to a required change of basis. 

2.3 The Markov semigroup 

A stochastic process can be described by introducing a time-dependent random variable X(t). A 
crucial component of the subsequent discussion will be that the time evolution of the corresponding 
probability distribution can be viewed as a linear mapping upon a vector space. Presently we will 
establish the conditions for a Markov process, and show that such a process satisfies the desired 
property. See, for example, [38] for an equivalent derivation. 

Consider a time-dependent, finite-state random variable, X{t), taking on values in K, any set 
of times ti < t2 < ■ ■ ■ < tn < t, and the joint distribution of X across those times: 

P[X{ti)=ii,X{t2)=l2,---,X{tn)=in,X{t)=i]. 

The distribution of X at the particular time t is given by the marginal, 

¥[X{t) = i]= J2 nXiti) = ii,X{t2) = i2,---,Xit,-,)^tn,X{t) = tl 

and this can be re-expressed by invoking the conditional distribution: 

P[Xit) = i]^ J2 P[X(i)=^|X(tl)-^l,X(t2)-^2,...,^(t«)=^„] 

l<.ii,i2 ■,■■■, in ^k 

■P[X{ti)^H,X{t2)^i2,---,X{tn)^i]. 



The simplest stochastic process is the process for which the probability of a transition to a new 
state at a given time is independent of the states at all preceding times (such as tossing of a 
coin-the Bernoulli process). A Markov process can be seen as the next simplest case where the 
probability of a transition is independent of all but the state at the most recent time. Thus, for a 
Markov process the conditional distribution satisfies 

This implies that the marginal distribution of X at the time t is 

F[X{t)=i]= J2 nx{t)=i\X{U,)^in] 

l<i„<k 

^ P[X{ti)=n,X{t2)^i2,...,X{tn)=in] 

l<.ii ....,in-i <A; 

[X{t)=l\X{tr,)=ln]nX{U,)^l„]. 



l<i„<k 

Introducing the time-dependent measure /i* with /i*({i}) :=fil=F[X{t) = i], we can express this as 

for all s < t, and for Mij{t, s) := P[X{t) =i\X{s) =j]. If we consider the {Mij{t, s))-^^- ■^^. as the 
matrix elements of a linear operator M{t,s) acting on the vector space M.'^ D M-{K) with basis 
elements 61,62, ■■■ ,Sk, we see that, as promised, for a Markov process the time evolution of the 
probability distribution is given by a linear map on R'"' defined by its action on time-dependent 
probability measures: 

l^ ^ l^ , (3) 

fi* ^ M{t,s)fi'. 

This linear map describes the general timc-inhomogeneous finite state Markov process and can 
easily be extended to the whole of M'"'. 

In [83j JGS and PDJ considered stochastic matrices as elements of the general linear group, 
and used this property to study the structure of invariant polynomials (used as measures of 
entanglement in quantum physics) when evaluated on a phylogenetic tree. Presently we will 
define the Markov semigroup which serves to refine the definition of invariant functions to the 
more relevant case of a stochastic (but linear) time evolution. 

Define the time-dependent rate matrix, Q{t), as a (continuous) one-parameter family of lin- 
ear operators on the vector space A4{K), which in the 61,62, ■■■ ,6k basis has matrix elements 
satisfying: 

Q,j{t) > 0, Vi^ j; Quit) = - Z QJ^{t). 

The summation conditions can be equivalently expressed by defining the vector 6 = 61 + 62 + - ■ ■+6k 
and its transpose 9^ , and setting 

e'^Qit) = 0, 

for all t. 

The Markov semigroup on k elements, 9Jl(fc), with parameters < s < t < cxd, is defined as 
the subset of (differentiable) two-parameter linear operators on M{K) which satisfy 

M{t,s) = l, yt^s; 



the Chapman-Kolmogorov equation: 

M{t,s)M{s,r) = M{t,r), Vr < s; 
and the backwards and forwards equations: 

«^ = -M(M)Q(.), 
^y ' ^Q{t)M{t,s); 

for any rate matrix Q{t) [27l [38]. Solutions of (|4]) can be represented using the time-ordered 
product (or ordered-exponential): 

M(t,s) =Texp Q{u)du (5) 

J s 

[551 Chap. 4], from which it foUows that 

dctAf(i,s) =cxp / tr{Q{u))du, (6) 

and 

The time-ordered product is best understood by considering the approximation 

M{s + 2e, s) = M{s + 2e, s + e)M(s + e, s) ~ e'5(^+^)^e'3(")^ 

By considering ((?]) for the case t — s, it foUows that in the 61,62, ■ ■ ■,6k basis, the matrix 
elements of each M{t, s) lie in the interval [0, 1] for all s < t. Thus, the Markov semigroup 
corresponds to the subset of the set of stochastic matrices subject to the condition that for each 
matrix there exists a rate matrix (or generator) Q{t) such that ([5]) is satisfied. We refer to elements 
of the Markov semigroup as Markov operators. 

In the time-homogeneous case where the rate matrix is time-independent: 

Q:=Q{t)=Q{0), 
it follows that M{t, s) is dependent only upon the difference {t — s), and form (O becomes simply 

0<n<oc 

In iJ3| we will discuss some representation-theoretic properties of certain groups affiliated with the 
Markov semigroup. 

2.4 Phylogenetic tensors 

A tree, T, is a connected graph without cycles and consists of a set of vertices and edges. Vertices 
of degree one are called leaves. We work with oriented trees, which are defined by directing each 
edge of T away from a distinguished vertex, p, known as the root of the tree. Consequently, a 
given edge lying between adjacent vertices u and v is specified as an ordered pair {u,v), where u 
lies on the unique path from p to v. A cherry is a pair of leaf vertices with the same parent vertex. 
Assign a random variable, Xy, to each vertex of the tree, and, as described in [751 Chap. 8], a 
joint distribution of the random variables at the leaves is determined by specifying a distribution 
ttGM{K) at p and a Markov operator M"'"e9Jl(fc) for every edge {u,v). In particular, for every 




Figure 1: Phylogenetic tree with two leaves 

V, the random variable Xy is conditional on only the random variables lying on the path from p 
and V, and for each pair of vertices vi, V2 with common parent u, the joint distribution of Xy-^ and 
X„2 is given by 

l<j<k 

where fi" is the distribution of X„. The empirical interpretation of the joint distribution across 
the leaves is that of a sampling distribution from which an alignment of molecular sequences is 
constructed by drawing one character pattern at a time. Throughout this paper, we will consider 
phylogenetic trees where the root distribution and the Markov operators are arbitrary. 

Note that we insist that the Markov operators belong to the Markov semigroup, so a continuous- 
time process is in action throughout the tree. This additional analytic structure means that this 
model is slightly less general than the general Markov model as defined in [Ij I41j . The general 
Markov model allows for arbitrary transition matrices with positive entries and unit row-sum (unit 
column-sum in our formulation), and it is not hard to find a matrix satisfying these conditions 
but with determinant less than or equal to zero, directly contradicting ([6|). 

We will consider a joint probability distribution on m leaves as a probability measure, P € 
&"M{K), that we refer to as a phylogenetic tensor. Presently we review how these tensors can 
be constructed using purely algebraic operations. 

The branching process ([7]) can be interpreted as a map that takes probability measures on K 
to probability measures on KxK = K^. In [40l[83], it was shown how to formalize this by defining 
the linear operator 

S : M{K) -^ M{K)® M{K). 

Demanding the conditional dependencies that are required for the standard definition of a tree 
distribution ^721 Chap. 8], we have (expressed in the (5i, 52, . . . , (5^ basis) the specification 

5 : 5i^^ 5i®5i, 1 < i < k. 

The phylogenetic tree with two leaves (Figure [T]) can then be represented as the string 

P = (Afi ® M2) • {S ■ tt), 

where Mi and M2 are the Markov operators on the two edges of the tree and, if Xi and X2 are 
the random variables at the leaves 1 and 2, respectively, we have 

nXi^i,X2=j] = P,, := P{{i}x{j}). 



This construction can be generalized to any phylogenetic tree by colouring the root of the tree 
with a distribution tt, each internal vertex (including the root) with a branching operator 6, and 
every edge with an arbitrary Markov operator. The phylogenetic tensor is constructed by beginning 
at the root of the tree, and then recursively moving to the child vertices and applying the relevant 
operators to the corresponding slots in the (growing) tensor. Whenever a leaf is encountered, 
continually apply the identity operator at that leaf, until all leaves have been reached and the 
phylogenetic tensor is complete. A phylogenetic tensor, P, is then represented as a string made 
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Figure 2: Constructing the phylogenetic tensor for a four taxon tree 



up of the characters tt, Mi,M2, . . . , and S, and the joint distribution of the random variables 
Xi,X2, ■ ■ ■ , Xjn at the leaves 1, 2, . . . , tti is given by 

¥[Xi=ii,X2^i2, ■ ■ ■ ,X,n = im] = ^Jii2...«™ :=^({Ji}x{i2}x... x{i^}). 

For example, the phylogenetic tensor of four leaves (Figure [2|) is represented by the string 

P = (1 1 (g) M3 » Mi) ■ (1 (g) 1 5) • (1 M2 (g) Afg) • (1 5) • (Ml Me) ■ {S ■ tt), 

and is constructed in the steps 

TT ^ ^ • TT -> (Ml ® Me) • ((5 • tt) ^ (1 5) • (Ml ® Mg) • ((5 ■ tt) 
^ (1 g) M2 «) Ms) • (1 (g) J) • (Ml (g) Me) • (J • tt) 

^ (1 ® 1 g) 5) • (1 g) M2 (g) Ms) • (1 g) 5) • (Ml g) Me) • ((5 • tt) 

^ (1 g) 1 g) M3 (g) Mi) • (1 g) 1 g) J) • (1 g) M2 g) Ms) • (1 g) 5) • (Ml g) Me) • (5 • tt). 

In order to define Markov invariants, we must also define two reduced tensors based on P, the 
trimmed tensor P and the pruned tensor P* . These are both constructed by modifying the 
underlying tree. The trimmed tensor P is constructed by taking P and setting the Markov 
operators on the pendant edges all equal to the identity operator, or equivalently setting the 
lengths of the pendant edges to zero. The pruned tensor P* is constructed by removing all 
cherries from the trimmed tensor. The rank of the pruned tensor is {m — c) where c is the number 
of cherries on the underlying tree. 

In the general case, we can relate P and P as 



P = (Ml M2 ig) 



M„) • P, 



(8) 



where Mi, M2, . . . , M,„ are the Markov operators on the leaf edges. In what is to come, we will 
continually use this relation. 

As an illustration of the relationship between P, P and P*, take the seven leaf tree (Figure [3]), 
with phylogenetic tensor given by 



P = (1 g) M2 g) M3 g) M4 (g) Ms g) Me g) M7) • (1 (g) 5 g) <5 g) (5) 

• (Ml g) Ms g) Mg g) Mio) • (5 g) J) • (Mil 



M12) • ((5-7r). 
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The trimmed tensor corresponding to the tree (Figure 3]) is obtained by cHpping off the pendant 
edges: 

P == (1 ® (5 «) (5 (g) (5) • (1 (g) Ms (g) Mg » Mio) • ((5 g) (5) ■ (Mn g) M12) • {S ■ n), 

and, finaUy the pruned tensor corresponding to the tree (Figure [S]) is expressed as: 

P* = (Mg g) Mg g) Afio) • (1 (5) • (Afii g) M12) • {S ■ tt). 



2.5 Markov invariants, definition 

With the form ([8]) in mind, we define a Markov invariant of weight (wi, W2, ■ ■ ■ , Wm) as a function 
satisfying 

/(P) = (detMi)"'KdetM2)-^ . . . (det M„)-'"/(P), (9) 

for aU Ml, M2, . . . , Mm € dJl{k). We exclusively consider polynomial functions, and where wi — 
W2= ■ ■ ■ =Wm = w, the Markov invariant is said to be of weight w. 

Considering the above discussion of unbiased estimators of random variables, an unbiased 
estimator of a Markov invariant is a function, /, such that 

E[f{Z)] = f{P) = {detMir^idetM^r- . . . (det M„ )'"'"/ (P). 

Such an estimator depends, up to the multiplicative scaling factor, only upon the internal structure 
of the phylogenetic tree. It is exactly this property that can be productively engaged in the context 
of phylogenetic tree inference. 

Conversely, a phylogenetic invariant is a function satisfying /(P) = for all P belonging to 
the family of phylogenetic tensors arising from a particular tree (or subset of trees). In fJHwe will 
show that there exist Markov invariants for trees with three and four leaves that are simultaneously 
phylogenetic invariants. 

Given a Markov invariant, /, consider the induced function, /*, defined on pruned tensors and 
specified by evaluating the trimmed tensor: 

/*(P*) = /(P). 

This induced function is easily extended to be defined upon all of (g™^^A^(iir), where c is the 
number of cherries on the underlying tree of P. Such cases are of special interest for phylogenetic 
problems. In 2] we will review a case (reported in a less general context in [HI]) where this induced 
function is itself a Markov invariant. We expect that future investigations of Markov invariants 
will reveal more cases such as this. 

In §3. 51 we will establish existence conditions for Markov invariants using standard results from 
group representation theory. 




Figure 3: Phylogenetic tensor P for seven taxon tree ((1,23), (45, 67)) 
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3 Group representation theory in phylogenetics 

In this section we use the algebraic description of probability distributions on phylogenetic trees 
given in ^2A\ to establish natural connections with aspects of representation theory, for certain 
groups affiliated to the Markov semigroup. These are discussed in i )3.1l Then follows ('i )3.2l and 
H3.3P a brief outline of those aspects of the representation theory of the general linear group and 
its subgroups that are needed for the discussion of group branching rules. This leads to the 
construction of one-dimensional representations and their identification as invariants ft i3.4p . with 
existence conditions given in 



3.1 The Markov semigroup and affiliated groups 

The linear transformation ([3]) effected under the Markov semigroup on probability measures is 
closely related to certain group actions on the vector space R*^. Given that the corresponding 
representation theory is unchanged |i48_ . in this section we will generalize to complex vector space, 
(as with [T]). That is, here and below, for algebraic purposes we regard the 61,82, ■■■ ,5k as 
elements of a basis for V = C^ . Thus, the probability measures become a subset lying in the 
ambient complex space (gj^C*^ D (d"''M{K). For related considerations involving the study of 
invariants of stochastic matrices see [151 [SS] . 

Referring to ^ and noting that — cxd < tr{Q{t)) < for all t, the determinant of each element 
M{t, s) lies in the interval (0, 1], and the Markov semigroup occurs as a subset of the general linear 
group: 

m{k) C GL{k). 

GL{k) is the group of invertible linear operators on the fc-dimensional vector space C'^. The 
smallest subgroup of GL{k) that contains 2Jl(fc) is obtained by taking 37l(fc) together with all of 
its operator inverses. In order to apply known methods of representation theory, we will, however, 
not work with this group directly. We define a slightly less refined subgroup as the focus of the 
impending discussion. 

Generalizing the notation of |65], we define the subgroup GLi{k)<\GL{k) as the subset oiGLik) 



13 



whose matrices in the 61^62, ■■■ ,Sk basis have unit column-sum. That is, for aU g e GLi[k): 

The group property clearly holds, as for all 31,52 G GL\{U): 

0^i9ig2) = iO^gi)92 = O^g2^e^. 

This group is isomorphic to the complex affine groupQ 

Gi(fc-l) K T(fc - 1) = A{k - 1), 

where T{k — 1) is the group of linear translations on C''^^. As shown in Appendix \X[ this 
isomorphism is due to the column-sum condition being, in effect, a statement that the group 
elements are dual to linear transformations in fc-dimensional complex space, leaving a fixed vector 
invariant. 

Consider also the doubly- stochastic Markov semigroup, 9Jl*(A;), obtained by requiring an addi- 
tional condition on the rate matrices: 

Q{t)0 = 0. 

The associated subgroup of the general linear group is then denoted as GLi^i(k); the subgroup of 
matrices in GL[k) which have unit column- and row-sum with, for all g £ GLi^i{k): 



Again the group property can easily be shown to hold. Thus the doubly-stochastic Markov semi- 
group is naturally affiliated to the associated group GLi^i(k) which, also as shown in Appendix [A1, 
itself is isomorphic to GL{k—l). 

To summarise, consider the subgroup chain: 

GL(k-l) ^ GLi,i(fc) < GL{k-l) K T(fc-l) = A{k) ^ GLi{k) < GL{k). (10) 

and the set inclusions: 

m{k) C GLi{k), 
mt*{k) cGLi,i(fc). 

We now have a clear picture of how to develop the representation theory of the Markov semi- 
group which focuses on algebraic properties and avoids the analytic details due to the positivity 
requirement and semigroup property. This is the correct framework in which to exploit the Schur- 
Weyl duality (i i3.2p and, considering the above inclusions, all results presented will be valid for 
the Markov semigroup. The above subgroup chain will feature in i i3.4l where we give existence 
conditions for Markov invariants. 

3.2 Representations of GL{k) and Schur-Weyl duality 

Our purpose here is to show that the close relation of the Markov semigroup to affiliated subgroups 
of the general linear group allows the machinery of representation theory to be applied in analysing 
the models used in phylogenetic inference. 

From the form of the general Markov model on phylogenetic trees given earlier ([8]) , it is evident 
that the representation-theoretic considerations must be extended to tensor products. We now 
provide some standard results within this setting (see, for example, [211 Lecture 6]). 



^The symbol K denotes the semi-direct product of two groups Z, Chap. 1]. The standard physical example is 
the Euclidean group, which occurs as the semi-direct product between rotations and translations in M" . These are 
none other than the set of transformations that define Euclidean geometry. 
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For GL{k) and its classical subgroups it is well known that for the defining representation 
on 1/ = C*^, with V ^^ gv, extended to a reducible representation on (^"^V in the obvious way, 
vi iS) V2 ^ ■ ■ ■ ^ Vm >-^ gvi (8) gv2 ® . . . (8) gvm, there is a direct sum decomposition, 

0™F = Y. ®^^^ (11) 

into (possibly reducible) subspaces V^. These subspaces (or modules) are labelled by integer 
partitions, A = (Ai, A2, . . . , A„), of m, the Ai being nonzero and nonincreasing and such that 
Ai + A2 + . . . + A„ = m. If A is a partition of m, we write A h m and |A| =m. The corresponding 
module V"^ is determined by a unique projector on ^™V; the Young's operator Y^. The f\ are 
integer multiplicities determining how many times each module occurs in the decomposition. The 
Schur-Weyl duality is the classic result that each f\ is none other than the dimension of the 
irreducible representation associated with the same partition A of the symmetric group &m- This 
reflects the role of the symmetric group's action on ^'^V by permuting basis elements across the 
tensor product, when constructing the Young's operators. 

The character of a representation is defined as the set of traces of the representing matrices; 
one for each group element. The irreducible representations of a group can be enumerated by 
solely considering the corresponding irreducible characters. Thus the problem of decomposing a 
representation into irreducible modules (computing the multiplicities fx) can be performed at the 
level of the characterfjj. 

For instance, in the case of GL{k) itself, the V^ are irreducible, with character given by the 
celebrated Schur functions, s\, with 

sx{x) = tr(7rA(.9)), 

where T^x^g) is the representing matrix for group element g and Xi,X2t ■ ■ ■ ,Xk are its eigenvalues. 
The Schur functions are defined in their own right, and are uniquely determined by the semi- 
standard tableaux corresponding to the partition A [50] . 

The defining fc-dimensional representation in this notation is V^^^ = C'^, in which case the 
Schur function is S[iy{x) = xi + X2 + ■ ■ ■ + Xk- The Schur functions form a basis for the ring of 
symmetric functions on any number of variables, and the trace is a symmetric function. Hence, 
the problem of identifying the irreducible GL{k) modules in the above representation on (g)™^, 
reduces to identifying the Schur functions in the decomposition of the character with respect to 
this basiqj. 

A convenient and standard notation for Schur functions is given by enclosing the partition 
(or parts thereof) in braces [Mj- Thus {A} and {1}, are the Schur functions corresponding to a 
general irreducible and the defining representation of GL{k) respectively. For simplicity, we write 
7r{i}(ff) =9- 

For classical subgroups of GL{k), the modules V^ are no longer necessarily irreducible, and 
further combinatorial considerations (not required here) are needed to effect a complete reductiorO. 
More importantly, for GL{k) itself with V not the defining, but an arbitrary module, V^ say, the 
equivalents of the above modules, (V)'^, are again no longer irreducible in general. 

This construction introduces a fundamental operation for combining representations together; 
that of plethysm [54]. The character of (VP)^ is denoted {p}^{A}; the plethysm of {p} by {A}. 
In the simplest case {p} is the character for the defining representation, {!}, and by definition 
{IMA} = {A}. 

In general, for any symmetric functions A, B we have {p}®_{A + B) = {p}®A + {p}^B, and 



^Within the context of phylogenetics, see IS2I for an unrelated discussion of the irreducible characters of the 
symmetric group. 

*The stronger statement that the V"^ provide the complete set of irreducible modules of any integral represen- 
tation of GL{k) is valid [48j. 

^The classical subgroups of GL(k) are constructed by requiring, under the group action, the invariance of bilinear 
forms on V. 
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we recover 



{p}^{ J2 f>^^^^) =^p^ ^{p} <»■■■'» {p}, 



X\-m 

where {p} (E> {A} denotes the (commutative and associative) pointwise muhiphcation of the Schur 
functions, 

(M^{A})(^):-M(x)-{A}(x), 

and the Schur functions occurring in the decomposition of {p} {A} correspond to partitions of 
IpI + |A|. This of course reflects ([TT]) with V replaced by V^: 

y ® y . . . (g) y = ^ ®fx{vp)^. 

X\-m 

In particular, for rank 2 we have 

which at the level of the characters is described completely by 

{A}®{A}={AM2} + {AMl^}. 

This is the well-known decomposition of a representation into its symmetric and anti-symmetric 
Kronecker square, respectiveljo. 

Although it is a difficult task to evaluate the general plethysm (see [BU] for a review of sym- 
metric functions and their various manipulations), in practice all required operations of symmetric 
functions involving products, plethysms and group branching rules can be evaluated symbolically 
using an appropriate group theory package. Where required, we use Schur [HT] for this purpose. 

From ([8]), which gives the form of the phylogenetic tensor for a tree with m leaves, it is clear 
that the appropriate representation space to consider is indeed ^'"C'^, regarded not as a module 
of GL{k) as above, but rather carrying an irreducible representation of the action of the direct 
product group x™GL{k) — GL(k) x GL{k) x . . . x GL{k). That is, considering that a phylogenetic 
tensor lies in the ambient space (gi™C'^, the generic analogue of ([5]) is 

V'' = (51 ® 52 ® • ■ • ® 5m) • V', (12) 

where ip £ (g)'"C'^. In a phylogenetic setting, we must allow for differing Markov operators to 
act on each edge; hence the above form. It is usual in phylogenetics to take a fixed rate matrix 
for all edges, and allow the edge lengths to vary, thus creating different Markov operators from 
identical generators. In fact, the above generalization of the group action allows for differing 
Markov processes on every edge of the phylogenetic treo 

A complete representation-theoretic analysis incorporating the tree structure of phylogenetic 
tensors is a topic for future research, and we defer such a theory. We concentrate on analysing 
the group action defined by ([8|) , leading toward the derivation of Markov invariants while ignoring 
the underlying tree structure. In S]4]we will introduce a post-hoc procedure which allows the tree 
structure to be incorporated. This will allow Markov invariants to be applied in a practical setting 
without the need for the complete theory. 



''In the context of quantum physics, this corresponds exactly to the statistical properties for ensembles of bosonic 
and fermionic particles, respectively. 

'^ Under a (somewhat biologically unsound) model in which the evolution along the pendant edges occurs with 
identical transition probabilities, the group becomes the diagonal GL{k) subgroup of the m-fold direct product 
group, and the representation reduces accordingly, precisely as in the initial discussion above. 
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3.3 Representations of x"^GL{k) 

Here we derive the group branching rule which is required to identify the irreducible modules 
under the group action (fT^ . 

There is yet another product of symmetric functions; the inner product, defined as 

{A}0{p} = E7ApW' 

where |A| = \p\ = |cr| = n, and the 7J are the integer multiplicities of occurrences of the a 
representation in the Kronecker product representation between A and p of the symmetric group 

e„ [55]. 

Consider the direct product group GL{k) x GL{£), with group action on Vi V2, where Vi 
is fc-dimensional and V2 is i'-dimensional, defined by vi (Ei V2 ^^ gi^i CE5 g2V2- If the eigenvalues of 
giy 92 are xi,X2, ■ ■ ■ ,Xk and yi, j/2, • ■ • , 2/£ respectively, then the character of this representation is 
the product 

{l}{x) ■ {l}{y) - (xi + . . . + Xk){yi + ... + yi) = {l}{xy), 

with 

{xy) = {xiyi,xiy2, ■■■, X2yi, ■■■, Xkyi)- 

Generalizing this result, consider the natural embedding, GL{k) x GL{£) C GL{ki), and the 
{A} representation of GL{k£) restricted to the direct product group: ^ 1-^ '^\{gi x 32)^ with 
^ G (Vi ¥2)^- The character of this representation has decomposition 

{\}ixy)^ E 7p<xM(a;)-W(y); (^3^ 

P,ctI-|A| 

for details see [50l[88]. Thus, we see that the inner product plays an essential role in decomposing 
representations of the direct product group GL{k) x GL{£) into tensor products of irreducible 
modules of GL{k) with irreducible modules of GL{£): 

pMi-\\\ 

Presently we will use this result to derive branching rules for the group action that is relevant to 
phylogenetics ([5]). 

In its general setting, a (group) branching rule describes the decomposition of a representation 
of a group, G, when restricted to a subgroup, H C G (written as G [ H) |87l Chap. V, §18]. 
For the present purpose, we consider x"^GL{k) as a subgroup of GL{k"^), and given the defining 
representation of GL{k™), the corresponding branching rule is 

GL(fc'") i x"Gi(fc) : {1} ^ {1} ® {1} ® . . . ® {1} = «"{!}. 

On the left-side of the arrow, {1} denotes the defining representation of GL{k™'), whereas on the 
right-side, {1} denotes the defining representation of GL{k). 

If we take the generic {A} representation of GL{k"^), the appropriate branching rule ici 

{(Ti}O{o-2}0...O{o-„}3{A} 

GL(fc'") i x'"GL(fc) : {A} ^ ^ {ai} ® {aa} . . . {a„}. (14) 

cTi ,a"2 ,...,a"TTxh| A 

This result can be confirmed using the identity p3p . 



*This is a special case of a more general embedding {1} -^ {Ai} ® {A2} (S) ■ ■ ■ (i) {Am}, for which each {ui} in 
the decomposition is replaced by the appropriate plethysm {Ai}®{cri}. For a recent discussion of the calculus of 
plcthysms see |20| . 
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The branching rule (J14p gives the decomposition of irreducible modules of the time evolution at 
the pendant edges of a tree, as induced by ([5]), but considered as the representation of x"^GL(k) C 
GLik"") defined by 

*' = 7rA(.9i X 52 X ... X g„) •*, 

for ^ G ((8)'"C'=)^. 

In the setting of phylogenetics, we show in iJ3.4l that specializing to {A} = {d} gives the decom- 
position of (homogeneous degree d) polynomials of phylogenetic tensors. In a practical setting, 
this corresponds exactly to taking (polynomial) transformations of the observed data set of char- 
acter pattern counts. That is, recalling that the expectation value of character pattern counts in 
a sequence alignment is governed by a joint distribution on a tree corresponding to a phylogenetic 
tensor P, the above branching rule tells us how arbitrary polynomial functions of the character 
pattern counts decompose into components which transform among themselves under the time 
evolution given in ([8]). In addition to what is presented here, this potentially has application to 
any analysis involving pattern counts in molecular sequence data (see fJ5] for further comments on 
this matter). 

In ^3.5\ we will exploit the branching rule directly, defining the one-dimensional modules in the 
decomposition (|14p as invariants, and give existence conditions for Markov invariants. We must 
first establish the isomorphism between homogeneous degree d polynomials on a vector space V , 
and the module V^'^^. 

3.4 Symmetric plethysms and invariants 

Associated with any representation y of a group G is the so-called coordinate ring P{V) of 
polynomialflj over C in the components vi,V2, ■ ■ ■ ,Vk, corresponding to a given basis for V. For 
such polynomials, f{v), there is a natural group action, 

f{v)^9-fiv):=fig-'v). 
There is an isomorphism between the ring ViV) and the symmetric tensor algebr4lj \/{V): 



oo 
i=0 d=0 



V{V) ^ V V\V) = V(y) ^ V y\v), (15) 



with V {V) ^ yf''^ and V'^{V) denoting the homogeneous polynomials of degree d. This refiects 
that an arbitrary homogeneous polynomial of degree dink indeterminates can be specified by an 
array of determinates /iii2...i^ which is symmetric under permutation of indices: 

Our interest in the above construction lies in the invariant ring, ViV)'^ , of polynomials that are 
invariant up to a multiplicative factor under the action of G, or more generally, for any subgroup 
H<G, 

f{hv) = det{hrf{v), (16) 

for all ft. G i? and v G V. For matrix groups the multiplicative factor is the determinant with w 
denoting the weight of the invariant. Using the isomorphism (jlSp . the identification of a linear 
basis of such invariants of degree d reduces to the identification, in the reduction of the V^"^^ , of 
the one-dimensional representations of H in the branching rule G i H. 



^T'{V) is the ring of polynomials in the basis elements, 5i,52i ■ • • iCfci of the dual space V* so that 'P(V) 
CKi,6, ■ ■ ■ ,S,k] with Ci((5j) = Sij for all 1 < i,j < k. 

^"See [28. Chap. 4] for a discussion of the symmetric tensor algebra. 
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In particular, the one-dimensional representations of GL{k) occur as follows. Note that the 
dimension of a representation is equal to the trace of the representing matrix of the identity. For 
the irreducible module V^ this is given by sa(1, 1, ■ • • , !)■ Thus, for a one-dimensional module, 
the corresponding Schur function must be monomial and (considering the definition of the Schur 
functions using summations over semi-standard tableaux given in [601 ) this occurs only for par- 
titions of the form {r^} for any integer r > 0. Additionally, considering that one-dimensional 
representations act by simply multiplying by the character itself, and that 

S{ri'}{x) = (xia;2 .. -XkY , 
= det(<7)^ 

we see that, for any ^ eV^^ \ we have 

-0 ^ det(g)''V', 

under the {r'^} representation of GL{k). This should be compared directly to ^T^ . 

Taking 9Jl(fc) C GLi{k), we can construct Markov invariants by identifying polynomials lying 
in the invariant ring for GLi(fc). Clearly, any polynomial 

must also satisfy (jH]) and is hence a Markov invariant. Recalling the salient subgroup chain ^TU\\ . 
affiliated to the Markov semigroup, the representation-theoretic task is to evaluate the relevant 
branching rules for specific cases. The required branching rules derive from (|14p . together with 
identification of the specific form of one-dimensional representations of the subgroup in question. 
It should be noted that this procedure leaves open the possibility that there exist Markov 
invariants that do not occur in the invariant ring for GLi{k). We leave this as an open problem, 
but note that it is plausible that such a possibility could be excluded by continuity arguments. 

3.5 Markov invariants, existence theorems 

Presently, we use the facts we have collected above to establish existence conditions for polynomial 
invariants for the group actions of GL{k), GLi{k) and GLuik). 

Theorem 1: Polynomial invariants for phylogenetic models. 
Linearly independent polynomial invariants at degree d of the groups: 

i. x'^GLik), 

ii. x™GLi(fc), and 

iii. x™GLi4(fc), 

are given by the one-dimensional modules of these groups occurring in the decomposition of the 
GL{k"^) module ((g)™^)'^'^^. In each case the one-dimensional modules correspond to ?7i-fold 
products of Schur functions labelled by partitions of d: 

i. {r''} (g) {r''} ® . . . (g) {r''}, 

ii. {ri + si, r^~^} ® {r2-|-S2, r^"^^} . . . ® {r„, + s^, r^^}, and 

iii. {ri + si,r'l~^,ti} (g {r2-f S2, r2~^, ^2} ® . . . ® {rm + s,n,r'^n^ ,t„^}, respectively, 

with 

kr = d, 

kva + Sa s d, and 
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{k - l)rb + tb + Sb = d, 

for all 1 < a, 6 < 771 respectively. 

Given the isomorphism ([TS]) and the branching rule (|14p with {A} = {d}, in each case the 
number of admissible partitions of the given forms {cti} {ci} ® . . . ® {(Tm} is the number of 
times the inner product {cti} {a2} ... {dm} of irreducible representations of the symmetric 
group &d contains the one-dimensional irreducible representation {d}. This is also the number of 
linearly independent polynomial invariants in each case. 

Proof: Each case identifies representations of x'"GL(fc) with character {cti} {<J2} ... {<Jm}, 
each component of which is a partition that corresponds to a one-dimensional representation of 
the respective subgroup. The dimension of this representation is the product of the dimension of 
each of the representations labelled by {(Ta}- Therefore the representation is one-dimensional if 
and only if, for each {cTq}, the corresponding representation is one-dimensional. 

For case (i), GL{k), as we showed in WSAl the representation labelled by {r''} is one-dimensional, 
providing an invariant of weight w = r. For case (ii), GLi{k), it is established in the appendix that 
the representation of GL{k) labelled by {ra + Sa, r^~^} contains a unique one-dimensional module 
under GLi{k). For case (iii), as will also be established in the appendix, GLi^i(fc) is isomorphic 
to GL{k—l) and the GL{k) character {ra + Sa,r'^~'^,ta} contains under branching to GL{k — l), a 
unique one-dimensional module with character {r^"^}. 

D 
Note that case (ii) is a special instance of case (iii), with ta — 0, and case (i) is a special instance 
of case (ii), with Sa = 0- This reflects the definition (fT6)) . 

Recall the inclusion 

m{k) cGLi{k)<iGL{k). 

It is clear that any invariant that exists for case (i), with w = r, or (ii), with w= T\=r2 = ■ ■ . = rm, 
is necessarily a Markov invariant, ([9]), with the particular form 

f{P) = (det Ml det A'h . . . det M„,rf{P). (18) 

In fj4] we will count occurrences of this type of Markov invariant for various cases of interest to 
phylogenetics; k — 2 to 4 character states and trees with m — 2 to 10 leaves. We will also briefly 
review the algebraic structure of these invariants in the cases m = 2 to A when evaluated upon 
phylogenetic tensors, and give examples of how this structure can be gainfully employed in the 
problem of phylogenetic tree inference from molecular sequence data. 

Taking case (ii) in its general form, we see that for wi= ri,W2= r2, ■ . ■ Wm = I'm, it is possible 
that there exist Markov invariants, taking the general form 

f{P) = (det A'fi'"^ det M^^ ... det M^" ) /(F). 

When the distinction is required, we refer to these invariants as mixed weight Markov invariants. In 
ij4.3l we will show that such invariants do indeed exist in various cases of interest to phylogenetics. 
However, as yet the explicit form of these invariants has not been constructed, and their structure 
remains unexplored. 
Recall the inclusion 

m*{k) c GLis{k) c GL(fc), 

for the doubly-stochastic Markov semigroup. The case (iii) establishes existence conditions for 
polynomial invariants for this semigroup. These invariants will be valid for any joint distribution 
on a phylogenetic tree which is constructed using only doubly-stochastic matrices. This includes 
oft-used models such as Jukes-Cantor, K80, K3ST and SYM [94]. We report the above theorem, 
but defer the exploration of the invariants in this case. 
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4 Markov invariants in phylogenetics 

In tj4.1l wc establish existence of Markov invariants relevant to phylogenetics for the cases of fc = 2 
to 4 character states, distinguishing between true Markov invariants and invariants which are valid 
for the full general linear group. In t J4.2l we report upon known algebraic relations between Markov 
invariants when evaluated upon phylogcnetic tensors for A: = 2 to 4 character states and for trees 
with 771 = 2 to 4 leaves. We also discuss the application of Markov invariants to the problem of 
phylogenetic tree reconstruction in these cases. Finally, in ij4.3l we establish existence of mixed 
weight Markov invariants for k — A character states and trees with m = 2 to 5 leaves. Throughout, 
we used Schur [91 for all non-trivial manipulations of Schur functions. 

4.1 Zoo of invariants and nomenclature 

We gave, in W3AI a sufficient condition for the existence of a Markov invariant, ^TE\i . of degree d 
and weight w: 

{r + s, r-'^-i} Q{r + s, r''-^} . . . {r + s, r'"^} B {d}, (19) 

where r = w and the inner product is taken m times, subsequently written as 0™{r + s, r''~^}. 
For reasons discussed below, taking r = results in the trivial inner product: 

{s}Q{s}^{s}, 

for all integers s > 0. Extending to to > 2, 

©'"{^ = {4, 

and the corresponding Markov invariant is denoted as <i> with degree d=l and weight w — 0, and 
simply expresses the conservation of total probability under the action of the Markov semigroup: 

$(p) s J2 ^^1*--." = 1- 

ii,i2,...,im 

Here $ is the invariant corresponding to s = l and for s>l the invariant is simply the power $". 

For fixed tti, and any two invariants /, /' of degree d,d' and weight w,w'^ we can form the 
pointwise product / • /' which is itself an invariant of degree d+d' and weight w + w'. li w = w', 
we can form an invariant from the sum f + f- These statements establish that the invariants, 
V{V)'^, form a graded ring [47] (where the grading is over both the degree d and the weights w). 
In particular, it is important to note that we can increase the degree of any invariant (keeping the 
weight fixed) by multiplying it with the trivial invariant $. 

When searching for Markov invariants, we must note that the sufficiency condition P^ will 
include these powers, and hence in what follows we must allow for this over-counting. In the 
conclusions we will expand upon this observation with some comments in regard to classifying the 
ring of invariants. 

The general linear, or s = 0, case 

Recalling Theorem 1, we see that for s = 0, the Markov invariants are simultaneously invariants 
under the action of the general linear group. Taking r = l, the inner multiplication is trivial: 

{in0{in = W- 

This reflects that the Kronecker product of the alternating representation of 6^, associated with 
the partition (1 ), taken with itself, is the trivial representation, which in turn is associated with 
the partition (fc). Recall that the alternating representation is one-dimensional whose action on 
C dcflned as multiplication by +1 if cr is an even permutation and is —1 otherwise. For this 
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274 
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85 


820 
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83917 
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1388857 
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2461 


2461 


504013 


8824 


13888996 



Table 1: Occurrences of {d} in Q™{r + . 



„fc-i 



} with rk + s = d 



one-dimensional representation, the Kronecker product is simply the numeric product, with the 
result being the trivial representation where every permutation is mapped to +1. Similarly 

{in0{fc}-{in, 

and we see that there exists a single Markov invariant of degree d=k and weight w — l for all even 
values of to,. 

A very familiar example occurs for to = 2 where, as we will discuss in ij4.2i the invariant arises 
as the Log-Det distance function [75]. In the next case, to. = 4, we refer to the corresponding 
Markov invariant as the quangle. 

Considering m — 2 and r = 2, we have 

{2'=} {2^} 9 {2fc}, 

for 2 < A; < 4. For each k, these invariants can be accounted for by taking the previous invariant 
and multiplying by <&. Thus nothing new is gained. 

However, taking to, = 3, it follows that there exists an invariant of degree d = 2k and weight 
w = 2: 

{2^=} {2'=} {2^-} 9 {2fc}. 

For fc = 2 this invariant is known in the quantum physics literature as the tangle \15\ I16j , where it 
is drawn upon to classify entanglement in 3-qubit systems, and has been generalized for A; = 3 and 
4 in the context of phylogenetics in [84j. In >j4.2l we will briefly review the most striking properties 
of the tangle relevant to phylogenetics. 



Bona-fide Markov invariants, s>0 

Here we consider the case s > 0, where the resulting Markov invariants are not simultaneously 
valid for the general linear group. In Table [T] we present the number of weight w — l invariants 
that exist for the cases fc = 2, 3, 4; m = 2, 3, . . . , 10 and s = 1, 2. All required computations were 
performed using Schur, and we have not reduced for over-counting. In Table[2]we summarize the 
Markov invariants for which we have successfully computed explicit polynomial forms. Here we 
also record the nomenclature we have developed. Presently we discuss the particular properties of 
these invariants when evaluated on phylogenetic tensors derived from a tree. 

4.2 What happens on a phylogenetic tree? 

By definition, the expectation value of a (bias corrected) Markov invariant, /, depends only upon 
the internal part of the phylogenetic tree: 

E[f{Z)] =. f{P) = (det Ml det Ah . . . det M^Y f{P), 
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03(22} 3 {4} 
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x3GLi(2) 


(3,1) 
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x4GL(4) 
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Q' 


04{21} 3 3{3} 


x4GLi(2) 
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04{212}3 4{4} 


x4GLi(3) 


(4,1) 






04{2l3}3 4{5} 


x4GLi(4) 


(5,1) 



Table 2: Markov invariants of degree d and weight w for m leaves, 



where Z is the observed counts of character patterns, P is the phylogenetic tensor corresponding 
to the joint distribution on the tree, and the trimmed tensor P, defined in ^^ is formed by 
setting the lengths of the pendant edges to zero. It is exactly this property that can be exploited 
in the practical setting of reconstructing phylogenetic trees from molecular sequence data. 

As discussed in the closing comments of §3.21 Markov invariants exist independently of any no- 
tion of a tree, and to uncover their potential use in the problem of phylogenetic tree reconstruction 
it becomes necessary to analyse their structure on particular trees. Crucial to this examination is 
the generalized pulley principle presented in [84j , which establishes that the family of probability 
distribution resulting from taking the general Markov model on a particular tree is unchanged 
under arbitrary placement of the root of the tree (see [T] for an equivalent discussion). Thus, our 
task is to search for algebraic relations between the Markov invariants valid for a given m, when 
evaluated upon the trimmed phylogenetic tensors corresponding to particular trees with m leaves. 
We are free to place the root arbitrarily, and we choose to evaluate the Markov invariants on trees 
where the root is located to our convenience. 

In Appendix |B] we present the general procedure for computing the explicit polynomial form of 
Markov invariants using the Young's operators (i i3.2p associated with the relevant partitions. Our 
general procedure was to take these explicit forms and then search for algebraic relations when 
the invariants are evaluated on the pruned tensor P defined by a particular tree. In the general 
case, any such relations potentially lead to phylogenetically informative statistics, valid under a 
general model of sequence evolution. Presently we will report upon this procedure in the known 
cases, TO = 2, 3 and 4. 



The simplest Markov invariant: the Log-Det 

Recall that the generic phylogenetic tensor on m = 2 leaves (Figure [1]) can be written in the form 

P = {Ml Ma) • (5 • tt). 

The corresponding trimmed tensor, P — S-tt, can be expressed in the 61,62, ■■■ ,Sk basis with the 
components 

As we showed above, there exists a single Markov invariant for m~2. The polynomial form of this 
invariant is easily derived by considering rank 2 tensors as matrices, and taking the determinant. 
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Figure 6: Phylogenetic tensor for the tree (1,23) 

Since the invariant is a function on tensors, we make the distinction by using a capital letter 
and denoting the invariant as Det. This distinction can be compared directly to the use of the 
determinant function in [J as opposed to the use in |76j . 
Substitution gives 

Det(P) = Y[ TT,, 

l<i<k 

such that, by the definition of Det as a Markov invariant, 

Det(P) = det(Afi)det(M2) J| tt^. ^20) 

l<i<k 

This form holds for any fc, and is exploited by taking the logarithm and computing the Log-Det 
distance measure [EH [SI] . 

Triplet distances: the tangle 

Inspection of Table [T] reveals that for m — 3 and s — there exists a Markov invariant, for each of 
k — 2, 3 and 4, of degree d = 2k and weight w — 2. This invariant is valid for phylogenetic trees 
with three leaves. For each of fc = 2, 3 and 4, the explicit polynomial forms of the tangle are basis 
independent (by definition) and have 12, 1152 and 431424 terms respectively. 

The generic phylogenetic tensor on the three leaf tree (Figure ^ can be expressed as 

P = (1 ® Ma M3) • (1 ® (5) • (Ml ® M4) • {S ■ tt). (21) 

The trimmed tensor, P = (1 (5) • (1 M4) • {6 ■ tt), has components 

P'h'h'h — Piii2"i2i3^ (22) 

where P* = (1 M4) ■ {S ■ tt) is the pruned tensor. 

The tangle is a Markov invariant and hence satisfies 

T{P) = (det Ml det M2 det M^fTiP). 

By explicit computation we have found that, for each of fc = 2, 3 and 4, 

r(P) =Det^(P*). 

Thus we see that the induced function of the tangle is T* = Det^. This is the example we promised 

in gm 

Consistent with ([^0]) we have 



Det(P*) =detM4 ]J tt. 



, i<i<fc 
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so, finally, we see that 

T(P) = (detA^idetM2detM3detM4)2 H ''M ' ^^^^ 

\l<i<fe / 

Due to the generalized pulley principle, (|23p holds for the phylogenetic tensor corresponding to 
any tree with three leaves. Comparing directly to (pO)l it is clear that the tangle may be used 
similarly to the Log-Det pairwise distance but for triplets of molecular sequence data. For further 
details in this direction see [51] , 

Informative statistic: the stangle 

We see from Table [T] that for m = 3 and s = 2 there also exists, for each fc = 2, 3 and 4, a weight 
w=l Markov invariant valid for trees with three leaves (of degree d = Q for fc = 4 states). We refer 
to this invariant as the stangle, that is, the stochastic tangle (see [81) for explicit expressions for 
the fc = 2 and 3 cases). As discussed in Appendix iBl the explicit polynomial form of the stangle for 
fc = 4 is known only in a basis different from the standard 81,82, ■■■ ,5^- In this basis, the stangle 
has 1404 terms with relevant data files available on Charleston's website [82]. This does not, 
however, prevent us from using the stangle in a practical setting as evaluation can be performed 
in this basis by transforming the data set (pattern counts) into the required basis. 

For the trimmed tensor with components given by (|22p . explicit computation shows that the 
stangle satisfies T^{P) = 0. Thus the stangle is simultaneously a phylogenetic invariant for a tree 
with three leaves (of course this again holds for any tree with three leaves) . 

Given an unbiased estimator, T^ , of the stangle, we see that under the family of probability 
distributions described by ((2T|) . the expectation value of this estimator when evaluated on triplets 
of aligned DNA sequences is zero : 

Elf'iZ)] = 0, 

where Z is the tensor of observed pattern counts in the aligned sequence data. Deviation from 
zero by the observed value of the stangle can thus be viewed as evidence that the data set violates 
the assumptions of the Markov model. We have had some preliminary (unpublished) success 
capitalizing on this property to rank subsets of aligned molecular sequences according to apparent 
concurrence with model assumptions. 

Note that the stangle must occur within the framework of phylogenetic invariants presented in 
[T] and the discussion of 53J. It would be interesting to determine whether the stangle is a linear 
combination (with coefficients that are d—\ polynomials) of the quintic phylogenetic invariants 
presented in [79]. However, whether or not this is the case is beyond the theoretical techniques 
presented in this paper and more work needs to be done before the precise connections between 
the stangle and the known phylogenetic invariants for this case become transparent. Further, 
because the explicit polynomial form of the stangle in the standard basis is not known, brute-force 
determination is impractical using algorithms presently known to the authors. 

Quartet inference: the squangles 

Inspection of Table [T] reveals that for k — A and m — A, there exist four Markov invariants of degree 
(i = 5 and weight w — 1 relevant to phylogenetic trees with four leaves. We refer to these invariants 
as the squangles. Again, the explicit polynomial form of the squangles is known only in a basis 
different from the standard one, and data files can be found on Charleston's website [82]. We have 
found that three particular linear combinations of the squangles are tree informative. Here we 
denote these three squangles as Q\,Qi and Qj,. In the non-standard basis, Q\ has 77004 terms, 
whereas both Q2 and Q3 have 91620 terms. 

On the quartet tree in Figure [T] the generic phylogenetic tensor is 

P ^ (Ml M2 «) M3 (g) M4) -{8® 8)- {Mr, (g) Me) ■ (8 ■ n). 
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Figure 7: Phylogenetic tensor for the tree (12,34) 

TT 




Figure 8: Phylogenetic tensor for the tree (13,24) 



The trimmed tensor P — {6^5)- (M5 (g) Mq) ■ {S ■ tt) has components: 

P — P* . S S 

with the pruned tensor given by P* = (M5 (g) Mq) ■ {S ■ tt). This form of the trimmed tensor can 
be evaluated directly on the explicit polynomial form of the squanglcs. We found that on the tree 
(12, 34) the squangles satisfy the algebraic relations: 

Qi(i5)^0, Q2{P) ^ -QsiP) > 0, 

with, intriguingly, the polynomial form of Q2{P) with respect to the components Pi-^i^ taking that 
of the vermanenr\ which, unfortunately for the phylogenetic context, is not a Markov invariant. 
An identical procedure was carried out on the phylogenetic tensors corresponding to the trees 
in Figure [8] and Figure [H This produced the relations 

Q2{P) = 0, Qi{P) = Q3(P) > 0, 
Q3{P) = 0, -Qi{P) = -Q2{P) > 0, 

respectively. 

Noting these relations, we see that we have constructed tree-informative phylogenetic invariants 
for trees with four leaves. In particular, for the unbiased estimators thereof, we have 



E[Qi{Z)]=0, E[Q2{Z) + QsiZ)] = 0; 



for the tree (12,34), 



E[Q2{Z)]=0, E[Qi{Z)-Q3iZ)]=0; 



for the tree (13, 24), and 



E[Q3{Z)] = 0, E[Qi{Z)-Q2{Z)]=0; 



^ ^ The permanent has identical algebraic form to the determinant of a matrix but with each term replaced by its 
absolute value. 
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Figure 9: Phylogenetic tensor for the tree (14,23) 

for the tree (14, 34). We also note that the Hnear combination 

W:^Qi-Q2-Q3, 

satisfies E[W{Z)] = 0, for any phylogenetic tree with four leaves. 

The bar charts in Figure [TUl compare the success of three tree inference methods tested on data 
sets created using Hetero [33]. All parameter settings used are as presented in 03] with sequence 
length A''=10000 and 10000 runs being completed in each case. A molecular clock was imposed, 
and for each run the tree used to simulate the data was 

treel == {{Seql : 0.495, Seq2 : 0.495) : 0.005, {SeqS : 0.495, SeqA : 0.495) : 0.005), 

with branch lengths given in time units and t = 0.495 and .005 corresponding to 0.1485 and 0.0015 
expected number of state changes, respectively. The G+C content was made to increase in leaves 
1 and 4 and was reduced in leaves 2 and 3. This tends to bias tree inference to tree3= (14, 23) 
as sequences 1 and 4 will tend to be more similar purely because of the G+C content. 

The Maximum Likelihood and Log-Det+NJ quartet inferences were performed using the default 
settings in Phylip ^3}, whereas the Log-Dct+BIONJ inferences were performed using the R [7T] 
package "ape" [68j . Finally, the squangles inferences were implemented in R using our own original 
code [12]. For the purpose of making a rough comparison, on average each evaluation took .58s 
for maximum likelihood, .036s for Log-Det+NJ, .090s for Log-Det+BIONJ, and .085s for the 
squangles. The squangles routine was designed for illustrative purposes only and was performed 
under simple statistical assumptions, as follows. 

The squangles were taken to be stochastically independent and normally distributed, with 
identical variances, ct^, and mean values set to or u > 0, depending on the quartet under 
consideration and the expectation values given above. That is, for each quartet in turn, we took 

P[Qi, Q2, Q3|(12, 34)] ~ AA(0, a^) * M{u, a^) * A/'(-m, a^), 

P[Qi,Q2,Q3|(13,24)]^AA(u,a2)^AA(0,CT2)*AA(u,a2), 

P[Qi,Q2,Q3|(14,23)]~AA(-u,a2)*AA(-u,a2)*AA(0,a2). 

Our primary scientific justification for these assumptions is that the resulting quartet inference 
routine performs rather well. 

Under these assumptions the maximum likelihood estimate (MLE) of u is independent of cr'^, 
and is equivalent to the least squares estimator. Analytic solutions are easily derived: 



MLE [u| (12, 34)] = max 
MLE [m| (13, 24)] = max 
MLE [m| (14, 23)] = max 



0, 



0, 



Q2{Z)-Q3{Z) 



Qi{Z) + Q3{Z) 



-{Qi{Z) + Q2{Z)) 
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Figure 10: Quartet reconstruction using the squangles. The charts present how many times 
the treel=(12,34), tree2=(13,24) and tree3=(14,23) were reconstructed using each of the three 
methods displayed. The tree used to simulate the data was treel. 

For each data set and candidate quartet, we computed the MLE for the mean value u and chose 
the quartet with the maximum likelihood. 

While our demonstration is not intended as an exhaustive comparison between the performance 
of our method and ML using the default settings of Phylip, it does show that using a stationary 
model for ML can lead to incorrect tree inference if the data was produced by a non-stationary 
process. With that caveat, it is clear that ML performs very badly as the G+C content increases, 
strongly favouring treeS. The Log-Det routine is robust against varying G+C content as the 
technique is based on a general model (this is consistent with what was found in [44 ) . Being valid 
for a general model, the squangles are also robust against varying G+C content, and actually 
perform slightly better than Log-Det. 

Interestingly, as the G+C content increases, the Log-Det and the squangles infer the true 
tree more often. Careful inspection reveals that this is because, as the G+C content increases, 
both these techniques tend to infer tree2 less often and treeS at approximately the same rate, 
favouring treel. This effect is more pronounced for the squangles. 

4.3 Mixed weight Markov invariants 

Here we report upon the existence of some mixed weight invariants for various cases of interest to 
phylogenetics. The polynomial form and algebraic structure on trees of these invariants remains 
completely unexplored. 

We concentrate on fc = 4 and look for mixed weight invariants for the degree d — 8 partition 
shapes {2^} and {51^}, corresponding to s = and 4 respectively. 

In the m = 2 case, we find that 

{2^} {513} 

does not contain {8}, which means there does not exist a mixed weight invariant for trees on two 
leaves. 

In the m — 3 case, we have 

{24}0{5l3}0{5l3}9{8}, 
{24}0{24}0{5l3}3{8}. 
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Writing w — (wi, W2, 103), we see that, including the three possible permutations across the inner 
products, there exist mixed weight invariants for the cases w — (2, 1, 1), (1, 2, 1), (1, 1, 2) and w ~ 
(2, 2, 1), (2, 1, 2), (1, 2, 2) respectively. 
In the m^A case, we have 

{2^} {513} Q |5^3| Q |5^3| 3 ^4|8|^ 

{24}0{24}0{5l3}0{5l3}9 9{8}, 

{24}0{24}0{24}0{5l3}E5 4{8}. 

Taking account of the permutations, we see that there exist 14x4 = 54, 9x6 = 54 and 4x4 = 16 
mixed weight invariants for the cases w= (2, 1, 1, 1), w = (2, 2, 1, 1) and w~ (2, 2, 2, 1) respectively. 
Finally, in the m~5 case, we have 

{2*} {51^} {51^} {51^} {51^} 9 527{8}, 
{2*} {2^} {513} |5^3| Q |5^3| 3 212{8}, 
{2^} {2*} {2*} {51^} {51^} 3 90{8}, 
{2^} {2*} {2^} {2^} {51^} 3 46{8}. 

Again taking account of the permutation, we see that there exist 527x5 = 2635, 212 x 10 = 2120, 
90 X 10 ~ 900 and 46 x 5 = 230 mixed weight invariants for the cases w = (2, 1, 1, 1, 1), w = 
(2,2,1,1,1), w = (2, 2, 2, 1,1) and w = (2, 2, 2, 2,1) respectively. 

We expect that a future analysis of the explicit form of these invariants will lead to quite an 
array of informative statistics for phylogenetics. 

5 Discussion 

In this work we have defined and explored the construction of 'Markov invariants'. The primary 
tool exercised was group representation theory, applied to the usual Markov process present in 
probabilistic models of phylogenetic trees. 

It is evident that our present approach to phylogenetics offers many possibilities for further 
study. The various Markov invariants that we have identified and constructed provide strong 
candidates for improved tree estimation and parameter recovery under general model assumptions. 
In particular, the stangle f ij4.2p seems to provide a robust indicator of phylogenetic signal in subsets 
of aligned molecular sequences. Efforts are underway to incorporate the stangle into a clustering 
algorithm that provides the means to divide large phylogenetic data sets into smaller, manageable 
parts, inspired, in part, by the Disk-Covering technique of [37j. In §4.2( we presented a Markov 
invariant based quartet inference technique. The maximum likelihood estimation we employed 
was based on rather simple statistical assumptions, and it is clear that this technique could easily 
be improved upon. Detailed knowledge of the invariants' distribution is desirable not only in order 
to achieve correct tree inferences, but also to find confidence intervals (as [61] do for Log-Det and 
ML distances). In all its glorious detail, the joint distribution of the squangles can be derived 
using the multinomial distribution of Z: 



P[Qi(Z)-gi,Q2(^)-92,Q3(^)-93] ^J2^[Z = z-N] 




zer 

where the summation is over the variety T :—{z\Qi(z) — qi — Q2{z) — q2 — Q?i{z) — q3 = Q} ■ However, 
this distribution depends implicitly upon the model parameters underlying /i, and therefore negates 
the whole point of employing invariants in the first place! Clearly, a more coarse grained approach 
is desirable for intuiting an approximate distribution for the invariants that depends on just a 
few shape parameters. This can be achieved variously by studying the relevance and impact of 
the central limit theorem, deriving the first few moments using the generating function ([2]), or 
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conducting extensive simulation studies. This would help to provide rigorous justification for 
taking the invariants as normally distributed, as wc did for the squangles in m.2\ 

Citing poor performance on short sequences >33, 35J, there is a somewhat popular opinion that 
phylogenetic invariants are of limited utility when it comes to phylogenetic inference in practice. 
However, recent work suggests that this performance can be greatly improved by identifying 
"powerful" invariants [17 . For instance, |12j chose invariants for the K3ST model using a criterion 
arising from algebraic geometry, and |18j used a learning algorithm to choose invariants for the 
K3ST and Jukes-Cantor models. However, determining criteria that guarantee identification of 
statistically powerful invariants is in general an outstanding problem. In this context, we have 
shown clearly that Markov invariants can be of significant practical utility. For instance, one need 
only note that the simplest Markov invariant forms the structure of the Log-Det distance measure, 
an extremely popular tool employed in countless phylogenetic studies, while the simulation study 
we presented in M.2\ shows that Markov invariants can be used to infer quartet phylogenies with 
a success rate equivalent to, or greater than, popular methods. For phylogenetic invariants that 
arise as Markov invariants, it would be interesting to determine whether the additional analytic 
structure imposed by group invariance provides an effective criterion for identification of powerful 
invariants. 

Markov invariants occur as one-dimensional representations of a group action associated with 
the Markov semigroup. In this regard, we applied only a particular instance of the group branching 
rule HH) that requires each of the irreducible modules to be one-dimensional. The standard 
approach to maximum likelihood exploits the trivial instance of the same branching rule with 
{A} — {cTi} — W2} — ■ ■ ■ — Wm} = {1}, taking m copies of the fc-dimensional defining representation 
to obtain a ^'"-dimensional and degree d= 1 polynomial representation. From this perspective, 
the standard approach and the Markov invariants are simply two cases where the transformation 
properties of polynomials of molecular sequence data under the time evolution ([5]) are exploited. 
This begs the question whether there exist polynomial representations, of dimension other than 
these two extremes, that can also be effectively utilized in practical phylogenetic tree inference. 

Many of the different classes of phylogenetic models [36] can be affiliated with appropriate 
subgroups of GL{k), and can therefore be expected to have a place in the subgroup chain ([TO|l . In 
principle we can modify Theorem 1 f ti3.5p for each of these models and construct their associated 
Markov- type invariants. In this vein, Appendix[C]outlines a group-theoretic analysis of the Kimura 
3ST model connecting the Hadamard conjugation with the construction of the Cartan subalgebra 
for this model. The same considerations apply in principle to amino acid sequence models: this 
is simply a matter of setting k — 20 and using the same theory, though the computations involved 
will of course be more lengthy. 

As noted in H'6.2[ a representation-theoretic analysis of the space of phylogenetic tensors that 
includes the underlying tree structure has not been developed in this work. Ideally, for a given 
tree, one would like to obtain the structure of the ring of Markov invariants as a theoretical 
outcome, rather than obtain this structure using the post-hoc procedure presented in §4.21 A 
possible direction in this regard is to consider, for each tree with m labelled leaves, the subgroup 
of &m induced by identifying permutations that leave the leaf labelling invariant. This subgroup 
is discussed in ^S] Chap. 2] and, in a different context, in [6l Chap. 12, Topic 3]. We conjecture 
that this group may play a role, analogous to that of the symmetric group for the Schur-Weyl 
duality, in the construction of the irreducible modules for the space of phylogenetic tensors. 

The phylogenetic invariants form an ideal in the associated polynomial ring, and hence Hilbert's 
basis theorem for finite-generatedness applies. However, whether Markov invariants are finitely- 
generated is unknown. Technically, the group GLi(k) is non-reductive (its finite-dimensional 
representations are not completely reducible). In the non-reductive case, standard theorems, such 
as finite-generatedness of polynomial invariant rings, do not apply. Thus it is unlikely that the 
Markov-type invariants will be finitely generated in general. A notable exception is provided 
by Weitzenbock's theorem [73l [86] for finite-dimensional representations of one-dimensional Lie 
groups. Continuing the subgroup chain PH)) to its natural limit, it follows that in the case of 
phylogenetic tensors, the group GLi{k) provides, on restriction, an indecomposable representation 
of the additive group R+ (corresponding to time evolution). Thus, Weitzenbock's theorem is 
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relevant to the analysis of Markov invariants in the current context. In fact, this observation is 
pertinent to phylogenetic invariants for continuous time models, as they would occur as syzygies 
[661 Chap. 2] between invariants belonging to the generating set of the invariant ring for the 
representation of K+ in question. 
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A Proof of Theorem 1 

We provide a tensor-based completion of the proof of Theorem 1 fi i3.5p . regarding the identification 
of one-dimensional irreducible representations of the groups GL{k), GLi{k) and GLi^i(fc). 

Following the notation of Sj2l a probability measure can be written in a basis of point measures, 
fj, = X]i<i<A: ^^i^i^ with the Markov semigroup acting as 

l<j<k l<i<k l<j<k 

Moreover, probability conservation requires the column-sum condition X^i<i</j-^^ij = 1, for all 
^ ^ j < k. As discussed in Sj3l this affiliates the linear transformations M G DJl{k) with the 
subgroup GLi{k) <l GL{k). Correspondingly, a higher rank tensor, '0, transforms under the action 

of g e GLi{k), tp ^ V', with 

Y ni2J3--- ^ 2^ 9iljl9i2J29i3J3 ■ ■ ■'rjlJ2J3--- ■ 

In order to find combinations oi ipi-^^i^i^i^,,, which remain invariant up to scaling under the GLi{k) 
action, we transform to a more convenient basis in which the distinguished role of the vector 
(l,e^) = (1,1,... 1) is identified. Following [55], define a nonsingular k x k matrix, X, with 
1x1 + (fc— 1) X (fc — 1) block decomposition: 

A := f ^ ^^ V (A-2) 

Lemma: With respect to the similarity transformation g ^^ g = XgX^^ defined by any fixed X of 
the above form, GLi{k) is isomorphic to the afhne group A{k) ^ GL{k—l) k T{k—1). Furthermore, 
under the same mapping subject to the constraint 77 = —x ■ e, GLi,i{k) is isomorphic to the group 
GL(fc-l). 

Proof: Check explicitly that 

if g = f ,^ ^n , then A5A-1 = f ~ !^ 

using the column-sum condition on g. Clearly, dett; = detm, so m G GL{k — l) for all such X. 
Finally, if ^2 = 1 — w • e, A = 1 — ^7 ' ^ and 77 — —x ■ e, then ^ = in XgX^^ and g £ GLi_i(fc) is 
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thereby identified with the GL{k — l) subgroup of GL{k) consisting of matrices in block form as 
displayed. D 

It is convenient to re-label the basis as XSi -.= 60 = 61 + 62 + ■■■ + Sk, X6a ■= 6a, a^= 2,3, ... ,k. 
In the new basis, probability measures will transform mhomogeneously, with the 60 components 
invariant; for example mimicking (|A-1[) 



Mo=MO, ^i'a=ia^J.O + y^mabfJ'b, (A-3) 



E 

fc=2 



and in this way we can deduce the transformation properties of higher-rank tensors. 

As we discussed in §3.21 the finite dimensional irreducible representations of GL{k) associated 
with partitions A, are realized by tensors of rank |A| whose indices satisfy particular symmetrization 
conditions to be outlined in Appendix [B) symmetrize across the rows and then anti-symmetrize 
down the columns of the associated standard tableau T. Conventionally, for example, we write 
for such a tensor the components '/'[iii2.-][iii2--][--]- H^re the indices enclosed in braces [...] 
are mutually anti-symmetric, corresponding to column entries in T, and there are further cyclic 
identities (we need not consider) reflecting row dependencies of ip. 

Below we will discuss properties of such tensors in the 6^,62, ■. .5^ basis under the transforma- 
tion (|A-3p . The crucial result will depend absolutely on the indices, and the symbol '■0' will be 
superfluous. Hence, for ease of reading we will suppress the '■0': 

Aili2-]Ulh-][-] = [«1*2 •■■][jlJ2-- •][•■■]■ 

This is consistent with the amusing comments in the preface of |63| . 

Consider the reduction of an irreducible representation A of GL(k) with respect to the subgroup 
GL{k—l) (equivalent, by the Lemma above, to considering the restriction to GLi_i(/c) afHliated 
to the doubly-stochastic Markov semigroup) . The partition labels A of irreducible representations 
of GL(k — \) arising from this restriction are related to those of A by the standard betweenness 
conditions [HI Chap. V, §18] (see also [Il|88]): 

Ai > Ai >... > A„-i > A„. (A-4) 

Our present purpose is to identify one-dimensional representations of GL{k—l), that may extend to 
one-dimensional representations of GLi(fc) = GL{k—l)t<T{k—l). Such tensor representations must 
be associated with partitions A = {r'^~^ ) all of whose columns have length k — 1 corresponding to 
the r^^ power of the representation M 1-^ detM. However, for such a A, (|A-4|) above immediately 
implies that 

A = (r + s, r^'"^ , t), for some s > 0, t < r, 

and we have established part (iii) of Theorem 1. 

Within such tensor representations of type (r -I- s, r^~'^ , t), the component associated with the 
scalar representation (r*^^^) of GL{k — l) is clearly 

[0ai2 . . . aifc] [Oa22 . . ■ a2fe] ■ . ■ [Oat2 ■ • . atk] [&11&12 • • • ^i^fc-i] ■ ■ • [br-t.ibr2 ■ ■ ■ ^r-t,fe-i]0i02 . . . 0^. 

However, under the inhomogeneous group transformations (|A-3p with fhab = 6ab, ^a 7^ 0, we have 

[Oai2 . . . aik] [0022 . . . a2k] ■ ■ • [0at2 ■ ■ ■ atk] [biibi2 . . . 6i,fe-i] . . . [br-tsbr2 ■ ■ • br-t,k~i]0i02 ■■■Os 

[Oai2 ■ . . aik] [0a22 • • . a2k] ■ ■ • [00^2 . . . atk] [&11&12 ■ • ■ 6i,fc-i] ■ • ■ [br-t,ibr2 ■ ■ . br-t,k-i]0i02 . . ■ Os-t- 
£ai2 [00 . . . aik] [0a22 . . • a2k] ■ ■ . [Oat2 ■ • . atk] [&11&12 • • • bi^-i] . . ■ [br-t.ibr2 ■ ■ ■ ^r-t,fe-i]0i02 . . . 0^ + . . . 
-I- Ibii [Oai2 . . . aik] [0a22 ■ ■ . a2fc] ■ ■ • [0at2 . . ■ atk] [O612 . . . bik-i] ■ . . [for-ta&r2 • ■ • 6r-t,fc-i]0i02 ...Os 
+ 4i2 [Oai2 . . . aik] [0a22 ■ ■ . a2k] ■ ■ • [0at2 . . ■ atk] [biiO ■ . . 61,/c-i] • ■ • [br-t,ibr2 ■ ■ ■ br-t,k-i]0i02 ■■■Os 
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wherein the coefficients of the £q terms vanish by anti-symmetry, but those of the £{, terms 
clearly do not. The components corresponding to the desired A = {r^~^) one-dimensional rep- 
resentation of GL{k — \) within \ = {r -\- s,r''~'^,t) is therefore not invariant under inhomoge- 
neous transformations corresponding to translations in GL{k—l) k T(k~l) = GLi{k) unless the 
[61^1 . . . br-t.k-i] columns are absent, that is, t = r. Thus, the requirement of invariance of the 
one-dimensional representations under GLi(fc) necessitates A = {r + s,r''~^) as claimed in part (ii) 
of Theorem 1 . 

B The construction of Markov invariants 

The standard construction of the irreducible modules V^ is given, for example, in [241 Lecture 4]. 
Here we modify this procedure, to give the explicit polynomial form of the Markov invariants. 

Consider the representation of 6m on (E)™V defined by the action vi (E) V2 (E) ■ ■ ■ Vm '-^ i'a(i) ® 
''^a{2) fX" . . . (E" Vaim) ^OT all a G S^. Givcn a standard tableau T with shape A and |A| = m, 
define the permutations p G 6„i as those that interchange the integers in the same row, and the 
permutations q G S,„ as those that interchange numbers in the same column. In the algebra of 
the representation of the symmetric group whose action is defined above, consider the quantities 

pel 
and 

The Young's operator corresponding to T is then defined as 

Y^ = BA. 

It follows that for a standard tableau of shape A, the corresponding Young's operator projects 
onto an irreducible module of GL{k): 

V^ = Y^ ■ «)"y". 

This construction is independent of fc, and Young's operators corresponding to standard tableau 
of the same shape project onto equivalent modules. The independent tensor components of these 
irreducible modules are found by inserting integers from semi-standard tableaux into the indices 
of the generic tensor. To compute the explicit form of Markov invariants, we must apply this 
standard procedure to our special case. 

Begin with the generic form of a monomial in the components of the tensor ■06®™^: 

To find the polynomial form of an invariant that arises from an inner product of Schur functions 
{fTi} {C2} ... {cm} with (Tq = {?' + s, r''^^} for all 1 < a < m, and rk-\- s = d, we must apply 
the Young's operators to these indices. In an abuse of notation we write 

where each Young's operator 1"°"", 1 < a < m, is generated from a standard tableau of shape 
{(7a\ with integers chosen from the set {a, ttt, + a, . . . , (d — l)m -\- a}. That is, each y"'" permutes 
the indices Iq, ia+m, • • ■ , ia+md- The final step is to insert indices into ^ using the semi-standard 
tableau which results from filling the 1*** row with the integer 0, and, for 2 < i < k, the i*"^ row 
with the integer i. The justification for filling the "overhang" of length s in the first row of the 
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tableau with the integer 0, is that in the basis given in Appendix |^ the So component is an 
invariant subspace. For more details, including multiple examples, see [81 . 

This procedure has been implemented to garner the polynomial form of the Markov invariants 
for phylogenetic trees with up to four leaves. These are presented in Table [H The algorithms 
required were performed in Mathematica 90j, and, unfortunately, do not scale well for trees with 
more leaves. We are currently investigating the design of efhcient algorithms for this construction, 
and note here that 64] provides a promising direction. 

Additionally, in this construction the resulting polynomial form of the invariant is not in the 
61,62, ■■■ ,6k basis, and the required change of basis computation has thus far not been feasible. 
To evaluate the invariants on observed data, we therefore proceed by transforming the data itself 
into the appropriate basis. This allows us to evaluate Markov invariants on observed character 
pattern counts taken from phylogenetic data sets. 

The calculation of unbiased forms, as defined in ^ 12.21 is straight forward in principle. However, 
the calculation requires that the invariants be expressed in the 61,62, ■■■ ,6k basis. This appears 
to be a rather challenging computational task, and to date the required algorithms have not been 
developed. 



C Kimura 3ST model and phylogenetic invariants 

Our approach to phylogenetic models via group actions and representations finds specific applica- 
tion in some special cases, such as the Kimura 3ST |49) model and certain generalizations to be 
described below. Here we provide a brief discussion as an illustration of our focus. 
In the usual basis of point measures 6a,6c,6u,6g, the K3ST rate matrix Q, 



Qaa Qag Qau Qac 

Qga Qgg Qgu Qgc 

Qua Qug Quu Que 

Qga Qgg Qcu Qcc 

can be re- written [S], 

Q={a+l3+-i){-l + 



= -(a+/?+7)l + 



a /? 7 
a 7 /? 
/370a 
7 /3 a 



a + /3 + 7 



K^ 



a 



a + /? + 7 



Kfi^ 



where the three 'Kimura matrices' 



Kr. 



10 
10 
1 
10 



K„ 



10 
1 
10 
10 



K^ 



a + 13 + j^'' ' ' 



1 
10 
10 
10 



(B-1) 



(B-2) 



(B-3) 



span a Cartan (maximal commuting) snhalgebra of the group 5*^(4), and therefore can be diago- 
nalised simultaneously, via the well-known Hadamard transform |31| . 



H = h(^h-- 



HKbH-'^ = 



with 



^P 



h = 



1 


1 


1 


1 


1 


-1 


1 


-1 


1 


1 


-1 


-1 


1 


-1 


-1 


1 


" 1 














1 














-1 














-1 



HKr,H- 



HK-yH-^ = 



1 














-1 














1 














-1 


1 














-1 














-1 














1 



(B-4) 



34 



This simple observation means that under this model, rank-m phylogenetic tensors have a spectral 
resolution given directly in terms of weights of the appropriate x™{gl{l) x gl{l) x gl{l)) abelian 
subalgebra of x"'G'-L(4) (equivalently the weight decomposition of the corresponding representa- 
tion of X" 5*^(4)). 



In fact, a stronger statement is possible. The action of group elements of the form M{t) = e 



tQ 



turns out to be covariant with respect to the operator S introduced in 
branching in the general phylogenetic model - explicitly, in the notation of 



S ■ eyip{aKa + bKp + cK^) = exp(aiirQ, ® Ka + hKp ® Kp + cK^ K^) ■ S. 



above, describing 
2.41 we have 



(B-5) 



Applied to a phylogenetic tensor P with underlying arbitrary tree T, (jB-5|) then means that, 
under this model, the action of the Markov operators on each internal edge can be pulled back to 
the pendant edges, at the expense of a more complicated edge-mixing transformation. In final form, 
P is given by the action of a certain element of (GL(1) x GL{1) x GL(l))''™ within Gi(4™), with 
the embedding fixed by the tree, applied to the maximally branched product measure S^"^~^' ■ n, 
defined by 



S(^-^).tt^J2ttA 



(g)5i, 



with m tensor products in each ternj^H Further details can be found in [5] . 

Analyses of this sort are useful both analytically, and in explicit calculations. In particular, 
the identification of phylogenetic invariants for given trees becomes straightforward, once the 
components of P are written in the diagonal Hadamard basis. The group representation analysis 
provides a useful alternative to discrete Fourier transform methods which have been successfully 
applied where rate matrices admit a symmetry with respect to a discrete colour group, Z2 x Z2 x • ■ • 
[3ni32j . and may also be useful in the characterisation of phylogenetic varieties in the phylogenetic 
invariants analysis [21 [5D] (see also the discussion in ^ . 

The above considerations generalize to the case of any fc-state model wherein the off-diagonal 
part of the rate operator is a linear combination of a maximal set of commuting permutation 
matrices belonging to 6k, which guarantees (jB-5p . For example, this class would include a 3- 
state model even simpler than the K3ST model, but which is non-symmetric, and for which the 
Hadamard basis is complex: 



g=(a+/3)(-l + 



a + P 



-.K^ 



a 



,K, 



P) 



i^a = 



1 

1 
1 



Ka = 



a-l-/3 
"010 

1 

1 



(B-6) 



See [5] for further details 
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